67 million natural product-like compound database generated via molecular language processing

Natural products are a rich resource of bioactive compounds for valuable applications across multiple fields such as food, agriculture, and medicine. For natural product discovery, high throughput in silico screening offers a cost-effective alternative to traditional resource-heavy assay-guided exploration of structurally novel chemical space. In this data descriptor, we report a characterized database of 67,064,204 natural product-like molecules generated using a recurrent neural network trained on known natural products, demonstrating a significant 165-fold expansion in library size over the approximately 400,000 known natural products. This study highlights the potential of using deep generative models to explore novel natural product chemical space for high throughput in silico discovery.


Background & Summary
Nature produces natural products of immense chemical diversity 1,2 . A vast assortment of molecular scaffolds are produced by organisms to interact with their environment and to engage in chemical warfare with each other. This natural diversity has also been leveraged for wide-ranging applications such as in agricultural pesticides to increase food production 3 , food preservatives to facilitate distribution and storage 4,5 , and most prominently as therapeutic agents to treat diseases [6][7][8] . Indeed, it has been estimated that approximately 80% of all clinically used antibiotics can trace their origins to a natural product 6 .
Despite nature's potential for providing valuable molecules, assay-guided natural product discovery has been a low-yielding investment since the golden age of discovery in the 1960s 9 . After the initial wave of uncovering structurally unique and accessible natural product chemical space, subsequent efforts to venture into less accessible chemical space or to "rediscover" known natural product classes for novel applications have been met with limited success 10 . Tremendous effort must be invested in the biosynthesis, curation and characterization of natural product libraries, resulting in the culmination of only ∼400,000 fully characterized natural products known to-date 11 . The significant financial and resource requirements of assay-guided investigations have also resulted in a broad dampening of commercial interest surrounding natural product discovery 12 . However, the advent of deep generative modelling 13 and high throughput in silico screening 14 presents an opportunity to circumvent traditional time-consuming, costly, and experimentally-driven natural product discovery to reformulate it as a computationally-driven inverse design problem 15 . The potential of such an approach would also scale with the increasing size and availability of natural product databases 16 , growing alongside the trend of digitalization in chemical research 17 . In this data descriptor, we report an expansive, curated database 18 of 67,064,204 natural product-like molecules generated via an in silico pipeline (Fig. 1), representing a significant 165-fold expansion over the ∼400,000 known natural products 11 . We envision in silico structural generation playing an integral role in the future of natural product discovery 19 .
In contrast to manually curated natural product libraries, deep generative models transcend the boundaries of human-dependent molecular design to significantly expand chemical search space by orders of magnitude while concurrently reducing financial and resource requirements 20,21 . Some examples of deep generative architectures that have been employed for de novo molecular design include variational autoencoders (VAE) 22,23 , recurrent neural networks (RNN) [24][25][26] , and generative adversarial networks (GAN) [27][28][29] , with each adopting a different strategy with their own strengths and weaknesses 30 . The SMILES-based (Simplified Molecular Input Line Entry System) 31 RNN architecture with long short-term memory (LSTM) units was favoured in this work for its demonstrated ability to robustly generate novel and chemically diverse molecular entities in a low data regime 32 . A systematic benchmarking study 33 reported that SMILES-based LSTM generated 95.9% valid molecular structures, a significant improvement over VAE (87.0%) and GAN (37.9%) based architectures.
Here, we trained an LSTM model 24 on tokenized SMILES (with stereochemistry removed) from 325,535 (80%) out of the 406,919 known natural products in COCONUT, the collection of open natural products (https://coconut.naturalproducts.net/, accessed 1 Aug 2022) 11 . The model was able to break down SMILES into unique tokens (e.g. C, N, S, O, c, n, 1, 2..etc), learn how to assemble these token together according to the molecular language of natural products, and generate 100 million natural product-like SMILES with no specified stereochemistry 34 . Although stereochemistry in natural products can confer specific bioactivity 35 , our pipeline removes stereochemistry to reduce data complexity, lower file size, and improve fidelity of the generated structural database. In any case, a range of feasible stereoisomers for each molecule can still be obtained through iterative enumeration of its 3D structures 36,37 followed by back transformation to stereospecific SMILES 38 . Following this approach, extended isomer libraries of shortlisted SMILES of interest can be generated to cover wider isomeric space than a database of pre-defined stereospecific SMILES.
Although alternative approaches for the generation of natural product virtual libraries have been attempted 39,40 , prior libraries have been limited in terms of novelty (frequent re-occurrence of well-known scaffolds) 38 , natural product-likeness (43% meeting threshold compared to 85% in the training set) 39 , and scale (<1.5 million molecules) 39,40 . Moreover, these previously generated natural product virtual libraries have not been publicly released. In this data descriptor, we present an openly available virtual library 18 of >67 million natural product-like SMILES with a distribution of natural product-likeness scores similar to that of known natural products (Fig. 2) yet encompassing expanded physiochemical and structural space, indicating its potential for in silico discovery of natural products.
First, RDKit 36 function Chem.MolFromSmiles() was used to filter out 9,596,585 syntactically invalid SMILES from the 100 million generated set. Second, to ensure molecular uniqueness within the dataset, RDKit functions Chem.MolToSmiles(Chem.MolFromSmiles()) and Chem.inchi.MolToInchi() was used to convert the generated SMILES into canonical SMILES and International Chemical Identifier (InChI) representations for comparison and filtering, resulting in the removal of 22,484,883 (22%) duplicates (Fig. 2a). Third, the ChEMBL chemical curation pipeline 41 was applied for further sanitization and standardization by: (1) Checking and validating chemical structures, assigning an error score if structural issues are detected.
Error scores increase with the severity of the problem. (2) Standardizing chemical structures based on FDA/IUPAC guidelines 44 (3) Generating parent structures by removing isotopes, solvents, and salts Through this process, a further 854,328 invalid molecules with penalty scores exceeding 5 (indicating severe structural issues), were filtered out. Combined with the earlier detected syntactically invalid SMILES, a total of 10,450,913 (11%) invalid generated SMILES were identified and removed (Fig. 2a). The top 2 structural errors reported amongst the remaining valid molecules were (1) undefined stereochemistry (95%), which was due to the generation of SMILES without stereochemistry, and (2) the need for (de)protonation (2%), which was addressed later in Step 3 of the ChEMBL chemical curation pipeline. On the whole, these pre-processing steps refined the initial dataset down to this work's reported 67,064,204 (67%, Fig. 2a) valid, unique, natural product-like SMILES generated database 18 .
Fourth, RDKit was used to calculate natural product-likeness scores (NP Score) 42 for both known natural product SMILES and generated SMILES (Fig. 2b). NP Score employs atom-centred fragments (HOSE codes) 45 www.nature.com/scientificdata www.nature.com/scientificdata/ and bonding information to characterize structural features and calculate a Bayesian measure of molecular similarity to known natural product structural space 42 . The NP Score distribution of the generated natural product-like SMILES was found to closely resemble that of known natural products from the COCONUT database (Fig. 2b) with a Kullback-Leibler (KL) divergence of 0.064 nats, supporting that natural product-like molecules had been generated.
Fifth, the NPClassifier 43 toolkit was used to classify both natural product-like SMILES generated from the trained model and known natural product SMILES from the COCONUT database (Fig. 2c). NPClassifier 43 is a deep learning tool that considers structural features (counted Morgan fingerprints) 46 , taxonomy of the producing organism, nature of the biosynthetic pathway, and biological activity to characterize molecules in a holistic natural product classification framework. Despite this, 7,779,787 (12%) of the generated valid SMILES received no pathway classification -a larger fraction than 35,708 (9%) of the known natural product SMILES that also received no pathway classification. It has been reported 43 that deficiencies in NPClassifier can be traced back to limitations in its training data as the model relies on existing knowledge of natural products to classify molecules based on structural similarities. The comparatively higher percentage of generated SMILES with no NPClassifier pathway class suggests the presence of either synthetic structural features, or novel natural product class(es). However, similarities in the natural product-likeness score distributions of the generated and known datasets (KL divergence of 0.064 nats) suggests promising potential toward the latter. The remaining 59,284,417 (88%) of the generated valid natural product-like SMILES were annotated with a comparable distribution of biosynthetic pathways as known natural products from the COCONUT database with a KL divergence of 0.047 nats.
Finally, to describe physiochemical space covered by known natural products in the COCONUT database versus the >67 million natural product-like generated database, 10 physiochemical molecular descriptors for each molecule were calculated using RDkit 36 : T-distributed stochastic neighbour embedding (t-SNE) dimensionality reduction of the 10 calculated molecular descriptors into two-dimensional space was performed and plotted to visualize both physiochemical and structural space coverage (Fig. 3a).
The t-SNE 2D comparison shows a significant increase in physiochemical space covered by generated SMILES (Fig. 3a), indicating the presence of structurally novel natural product-like molecules in the generated database. Density plots (Fig. 3b,c) showing the concentration of structures across the t-SNE 2D projected space also highlight the significantly expanded structural space offered by the generated database even in overlapping physiochemical space (Fig. 3c). Overall, this workflow has enabled us to generate a significantly expanded database 18 of 67,064,204 characterized natural product-like molecules, greatly increasing natural product chemical space by 165-fold over the currently estimated 400,000 natural products known 11 . The >67 million natural product-like compound database 18 along with supporting files for the reproduction of this work has been made available on figshare 18 (see Data Records, Table 1). To facilitate usage, the structure and organization of the reported database has also been provided (see Supplementary Table S1).
As an indication of its cost efficiency, the total computation time for training and sampling was less than 24 hours on an Intel 8268 48-Cores @ 2.9 GHz Nvidia V100 (VRAM = 32 GB and RAM = 192 GB) compute node. A price estimate for similar computing resources on Amazon Web Services (https://calculator.aws/, accessed 23 March 2023) -24 hours of an dedicated instance (Amazon EC2, c5n.18xlarge instance, 72 vCPUS, 192 GiB memory, Asia-Pacific (Singapore) region, 100 gigabit network performance) would cost USD$155. In comparison, a commercially available 2,576 natural product library is priced two orders of magnitude higher at USD$33,513 (https://www.selleckchem. com/screening/natural-product-library.html, accessed 23 March 2023). Computationally generated natural product databases such as the one reported here are well positioned to push the boundaries of known natural product structures, provide expanded search spaces, and act as a key enabling resource to progress the next generation of in silico high throughput screening methods for natural product discovery.

Methods
Molecule generation. All software programs were implemented in Python (v3.6.10) with PyTorch (v1.1.0) on an Intel 8268 48-Cores @ 2.9 GHz Nvidia V100 (VRAM = 32 GB and RAM = 192 GB) compute node running on an RHEL 8.3 operating system. The details of all other dependencies can be found in the following environment. yml file (https://github.com/SIBERanalytics/Natural-Product-Generator/blob/master/environment.yml). The generative model was trained with a recurrent neural network (RNN) architecture using long-short-term-memory (LSTM) units (https://github.com/skinnider/low-data-generative-models). To assemble the training and held out datasets, the COCONUT collection of open natural products (https://coconut.naturalproducts.net/, accessed 1 Aug 2022) 11 was filtered to remove invalid SMILES and take away stereochemistry. This filtered COCONUT dataset was then split into 3 portions, 292,981 (72%) for training, 32,554 (8%) for validation, and 81,384 (20%) as www.nature.com/scientificdata www.nature.com/scientificdata/ a held-out dataset for testing. The combined training and validation dataset (80% of filtered COCONUT dataset) was augmented by 10 times with their respective non-canonical SMILES using SmilesEnumerator (http://github. com/EBjerrum/SMILES-enumeration) prior to RNN training. This has been shown to improve the validity of the SMILES sampled from the trained model 24 . Determination of the vocabulary of the known natural products was carried out by deconstructing SMILES strings into elemental tokens (e.g. C, N, S, O, c, n, 1, 2..etc). The network consists of 3 layers of RNN with a hidden layer dimension of 512 and no dropout. Training of the network was done with a batch size of 128, a learning rate of 0.001, Adam optimizer, and max epochs set at 1,000. Early stopping patience of 10,000 minibatches was employed. A total of 100,000,000 SMILES strings were sampled from the trained model (with best validation loss of 0.55) after completion of model training.

Kullback-Leibler (KL) Divergence.
A measure of the statistical distance between the property probability distributions of known natural product SMILES and generated natural product-like SMILES were calculated with SciPy (v1.7.3) using the function scipy.special.rel_entr(P,Q). This is also described by the following equation: Where, P(x) = probability of known natural product SMILES having value x for a given property and Q(x) = probability of generated natural product-like SMILES having value x for a given property.  www.nature.com/scientificdata www.nature.com/scientificdata/ with the following parameters: n_components = 2, init = "pca", random_state = 7. Seaborn (v0.11.2) histplot function was used with the following parameters: bins = 50, vmin = 0, vmax = 100,000 to generate structural density maps from the t-SNE data of the generated and known SMILES.

Data records
The 67,064,204 natural product-like compound database generated via molecular language processing in this work has been deposited on figshare (Table 1) 18 . The database is organized in a single, two-dimensional array flat model format where elements in each column are the same type of data for a given molecular descriptor and elements in the same row relate to the same molecule. There are a total of 38 columns (i.e. 38 descriptors for each molecule) and 67,064,204 rows (i.e. 67,064,204 molecules in the database). The column numbering, names, data types, and descriptions are listed in Supplementary Table S1.

Technical Validation
Testing of generated natural product-like molecules. From the 406,919 known, valid, unique, canonical, natural product SMILES strings in the COCONUT 11 database with stereochemistry removed, 81,384 (20%) were held-out and the remaining 325,535 (80%) were used to train and validate the recurrent neural network to generate natural product-like SMILES. Of the 81,384 known natural products that were held out as a test set from the training dataset, 30,229 (37% of held-out set) known natural products were reproduced in the generated natural product-like SMILES database, confirming the trained model can generate actual natural product molecules. In addition, the natural product likeness scores (NP Score) 42 and NPClassifier 43 pathway distributions of the generated natural product-like molecules have low KL divergence scores of 0.064 and 0.047 nats respectively when referenced against the observed distributions of known natural products from the COCONUT database 11 , indicating that natural product-like molecules have been generated.

Usage Notes
This generated natural product-like SMILES database covering novel physiochemical and structural space may serve as starting points for high throughput in silico discovery of functional natural products. Aside from potential food, agrochemical, and therapeutic applications, there has been increasing consumer demand for natural product alternatives to synthetic ingredients for their perceived health and wellness benefits 49,50 . Such natural alternatives are also amenable to sustainable manufacturing processes via synthetic biology approaches 51,52 , adding to their attractiveness as an answer from chemical manufacturers to environmental regulators 53 on issues of climate change, pollution, and resource depletion 54 .

code availability
Code used to train the molecular language model as well as the trained model used for natural product-like molecule generation is available from GitHub at https://github.com/SIBERanalytics/Natural-Product-Generator.